Predict House Prices Using Linear Regression¶
Data Dictionary¶
- SalePrice: the property's sale price in dollars. This is the target variable that you're trying to predict.
- MSSubClass: The building class
- MSZoning: The general zoning classification
- LotFrontage: Linear feet of street connected to property
- LotArea: Lot size in square feet
- Street: Type of road access
- Alley: Type of alley access
- LotShape: General shape of property
- LandContour: Flatness of the property
- Utilities: Type of utilities available
- LotConfig: Lot configuration
- LandSlope: Slope of property
- Neighborhood: Physical locations within Ames city limits
- Condition1: Proximity to main road or railroad
- Condition2: Proximity to main road or railroad (if a second is present)
- BldgType: Type of dwelling
- HouseStyle: Style of dwelling
- OverallQual: Overall material and finish quality
- OverallCond: Overall condition rating
- YearBuilt: Original construction date
- YearRemodAdd: Remodel date
- RoofStyle: Type of roof
- RoofMatl: Roof material
- Exterior1st: Exterior covering on house
- Exterior2nd: Exterior covering on house (if more than one material)
- MasVnrType: Masonry veneer type
- MasVnrArea: Masonry veneer area in square feet
- ExterQual: Exterior material quality
- ExterCond: Present condition of the material on the exterior
- Foundation: Type of foundation
- BsmtQual: Height of the basement
- BsmtCond: General condition of the basement
- BsmtExposure: Walkout or garden level basement walls
- BsmtFinType1: Quality of basement finished area
- BsmtFinSF1: Type 1 finished square feet
- BsmtFinType2: Quality of second finished area (if present)
- BsmtFinSF2: Type 2 finished square feet
- BsmtUnfSF: Unfinished square feet of basement area
- TotalBsmtSF: Total square feet of basement area
- Heating: Type of heating
- HeatingQC: Heating quality and condition
- CentralAir: Central air conditioning
- Electrical: Electrical system
- 1stFlrSF: First Floor square feet
- 2ndFlrSF: Second floor square feet
- LowQualFinSF: Low quality finished square feet (all floors)
- GrLivArea: Above grade (ground) living area square feet
- BsmtFullBath: Basement full bathrooms
- BsmtHalfBath: Basement half bathrooms
- FullBath: Full bathrooms above grade
- HalfBath: Half baths above grade
- Bedroom: Number of bedrooms above basement level
- Kitchen: Number of kitchens
- KitchenQual: Kitchen quality
- TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
- Functional: Home functionality rating
- Fireplaces: Number of fireplaces
- FireplaceQu: Fireplace quality
- GarageType: Garage location
- GarageYrBlt: Year garage was built
- GarageFinish: Interior finish of the garage
- GarageCars: Size of garage in car capacity
- GarageArea: Size of garage in square feet
- GarageQual: Garage quality
- GarageCond: Garage condition
- PavedDrive: Paved driveway
- WoodDeckSF: Wood deck area in square feet
- OpenPorchSF: Open porch area in square feet
- EnclosedPorch: Enclosed porch area in square feet
- 3SsnPorch: Three season porch area in square feet
- ScreenPorch: Screen porch area in square feet
- PoolArea: Pool area in square feet
- PoolQC: Pool quality
- Fence: Fence quality
- MiscFeature: Miscellaneous feature not covered in other categories
- MiscVal: $Value of miscellaneous feature
- MoSold: Month Sold
- YrSold: Year Sold
- SaleType: Type of sale
- SaleCondition: Condition of sale
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()
sns.set_context('talk')
%matplotlib inline
data = pd.read_csv('Ames_Housing_Sales.csv', sep=',')
Check the data types
data.dtypes.value_counts()
object 43 float64 21 int64 16 dtype: int64
Check the categorical features
data.select_dtypes('object').describe().T
count | unique | top | freq | |
---|---|---|---|---|
Alley | 1379 | 3 | None | 1297 |
BldgType | 1379 | 5 | 1Fam | 1166 |
BsmtCond | 1379 | 4 | TA | 889 |
BsmtExposure | 1379 | 5 | No | 582 |
BsmtFinType1 | 1379 | 6 | None | 426 |
BsmtFinType2 | 1379 | 7 | Unf | 790 |
BsmtQual | 1379 | 5 | TA | 442 |
CentralAir | 1379 | 2 | Y | 1310 |
Condition1 | 1379 | 9 | Norm | 1195 |
Condition2 | 1379 | 8 | Norm | 1365 |
Electrical | 1379 | 5 | SBrkr | 1273 |
ExterCond | 1379 | 4 | TA | 1221 |
ExterQual | 1379 | 4 | TA | 833 |
Exterior1st | 1379 | 14 | VinylSd | 498 |
Exterior2nd | 1379 | 16 | VinylSd | 487 |
Fence | 1379 | 5 | None | 1114 |
FireplaceQu | 1379 | 6 | None | 618 |
Foundation | 1379 | 6 | PConc | 633 |
Functional | 1379 | 7 | Typ | 1287 |
GarageCond | 1379 | 5 | TA | 1326 |
GarageFinish | 1379 | 3 | Unf | 605 |
GarageQual | 1379 | 5 | TA | 1311 |
GarageType | 1379 | 6 | Attchd | 870 |
Heating | 1379 | 6 | GasA | 1353 |
HeatingQC | 1379 | 5 | Ex | 720 |
HouseStyle | 1379 | 8 | 1Story | 686 |
KitchenQual | 1379 | 4 | TA | 676 |
LandContour | 1379 | 4 | Lvl | 1244 |
LandSlope | 1379 | 3 | Gtl | 1306 |
LotConfig | 1379 | 5 | Inside | 988 |
LotShape | 1379 | 4 | Reg | 861 |
MSZoning | 1379 | 5 | RL | 1101 |
MasVnrType | 1379 | 4 | None | 797 |
MiscFeature | 1379 | 5 | None | 1328 |
Neighborhood | 1379 | 25 | NAmes | 219 |
PavedDrive | 1379 | 3 | Y | 1293 |
PoolQC | 1379 | 4 | None | 1372 |
RoofMatl | 1379 | 8 | CompShg | 1354 |
RoofStyle | 1379 | 6 | Gable | 1070 |
SaleCondition | 1379 | 6 | Normal | 1137 |
SaleType | 1379 | 9 | WD | 1194 |
Street | 1379 | 2 | Pave | 1374 |
Utilities | 1379 | 2 | AllPub | 1378 |
There's no missing value in all columns
null_check = data.isna().sum()
null_check[null_check > 0]
Series([], dtype: int64)
2. Goals¶
Predict the house price based on its characteristics
num_cols = data.select_dtypes('number').columns
print('Total numerical features: ', len(num_cols))
num_cols
Total numerical features: 37
Index(['1stFlrSF', '2ndFlrSF', '3SsnPorch', 'BedroomAbvGr', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtUnfSF', 'EnclosedPorch', 'Fireplaces', 'FullBath', 'GarageArea', 'GarageCars', 'GarageYrBlt', 'GrLivArea', 'HalfBath', 'KitchenAbvGr', 'LotArea', 'LotFrontage', 'LowQualFinSF', 'MSSubClass', 'MasVnrArea', 'MiscVal', 'MoSold', 'OpenPorchSF', 'OverallCond', 'OverallQual', 'PoolArea', 'ScreenPorch', 'TotRmsAbvGrd', 'TotalBsmtSF', 'WoodDeckSF', 'YearBuilt', 'YearRemodAdd', 'YrSold', 'SalePrice'], dtype='object')
num_cols = data.drop(columns=['SalePrice']).select_dtypes('number').columns
%%time
fig, ax = plt.subplots(9,4, figsize=(16,27))
for i, col in enumerate(num_cols):
plt.subplot(9,4,i+1)
sns.distplot(data[col])
plt.tight_layout()
plt.show()
Wall time: 2min 37s
Normality test¶
Do normality test using D'Agostino K$^2$ test
$H_0$: data comes from normal distribution
$H_1$: data does not come from normal distribution
from scipy.stats.mstats import normaltest
for col in num_cols:
print(f'{col}: tstat = {round(normaltest(data[col].values)[0],3)}, p-value = {round(normaltest(data[col].values)[1], 3)}')
1stFlrSF: tstat = 443.049, p-value = 0.0 2ndFlrSF: tstat = 157.049, p-value = 0.0 3SsnPorch: tstat = 2187.193, p-value = 0.0 BedroomAbvGr: tstat = 43.992, p-value = 0.0 BsmtFinSF1: tstat = 603.908, p-value = 0.0 BsmtFinSF2: tstat = 1202.79, p-value = 0.0 BsmtFullBath: tstat = 2325.587, p-value = 0.0 BsmtHalfBath: tstat = 1108.571, p-value = 0.0 BsmtUnfSF: tstat = 156.01, p-value = 0.0 EnclosedPorch: tstat = 944.006, p-value = 0.0 Fireplaces: tstat = 73.538, p-value = 0.0 FullBath: tstat = 147.685, p-value = 0.0 GarageArea: tstat = 154.63, p-value = 0.0 GarageCars: tstat = 10.884, p-value = 0.004 GarageYrBlt: tstat = 98.369, p-value = 0.0 GrLivArea: tstat = 433.057, p-value = 0.0 HalfBath: tstat = 2664.983, p-value = 0.0 KitchenAbvGr: tstat = 1389.599, p-value = 0.0 LotArea: tstat = 2429.88, p-value = 0.0 LotFrontage: tstat = 939.397, p-value = 0.0 LowQualFinSF: tstat = 2258.578, p-value = 0.0 MSSubClass: tstat = 311.131, p-value = 0.0 MasVnrArea: tstat = 794.013, p-value = 0.0 MiscVal: tstat = 3391.976, p-value = 0.0 MoSold: tstat = 24.377, p-value = 0.0 OpenPorchSF: tstat = 682.274, p-value = 0.0 OverallCond: tstat = 162.265, p-value = 0.0 OverallQual: tstat = 18.636, p-value = 0.0 PoolArea: tstat = 2635.073, p-value = 0.0 ScreenPorch: tstat = 1148.437, p-value = 0.0 TotRmsAbvGrd: tstat = 104.288, p-value = 0.0 TotalBsmtSF: tstat = 622.45, p-value = 0.0 WoodDeckSF: tstat = 389.973, p-value = 0.0 YearBuilt: tstat = 94.216, p-value = 0.0 YearRemodAdd: tstat = 1317.444, p-value = 0.0 YrSold: tstat = 1063.988, p-value = 0.0
From the normality test above, none of the features have normal distribution as shown by p-value lower than $\alpha$ of 0.05. Let's check feature skewness and tranform features that have skewness > 0.75 or < -0.75.
skew_check = data[num_cols].skew()
skew_check[(skew_check > 0.75) | (skew_check < -0.75)]
1stFlrSF 1.390283 2ndFlrSF 0.786109 3SsnPorch 10.007116 BsmtFinSF1 1.678351 BsmtFinSF2 4.194649 BsmtHalfBath 3.917582 BsmtUnfSF 0.927963 EnclosedPorch 3.213038 GarageArea 0.811037 GrLivArea 1.411296 KitchenAbvGr 5.093935 LotArea 12.013038 LotFrontage 2.712348 LowQualFinSF 10.712587 MSSubClass 1.379754 MasVnrArea 2.601035 MiscVal 24.841008 OpenPorchSF 2.246211 OverallCond 0.866698 PoolArea 14.406273 ScreenPorch 3.987031 TotalBsmtSF 1.621602 WoodDeckSF 1.504088 dtype: float64
skew_cols = skew_check[(skew_check > 0.75) | (skew_check < -0.75)].index
Log Transformation¶
Check the feature distribution after log transformation
field = "LotArea"
fig, ax = plt.subplots(1, 2, figsize=(10, 5))
data[field].hist(ax=ax[0])
data[field].apply(np.log1p).hist(ax=ax[1])
ax[0].set(title='before np.log1p', ylabel='frequency', xlabel='value')
ax[1].set(title='after np.log1p', ylabel='frequency', xlabel='value')
fig.suptitle('Field "{}"'.format(field), size=18)
plt.tight_layout()
plt.show()
# pd.options.mode.chained_assignment = None
df_train = data.copy()
for col in skew_cols:
df_train[col] = np.log1p(df_train[col])
After log transformation there are not that many features with high skewness.
skew_check_2 = df_train.skew()
skew_check_2[(skew_check_2 > 0.75) | (skew_check_2 < -0.75)]
3SsnPorch 7.506825 BsmtFinSF2 2.466656 BsmtHalfBath 3.824348 BsmtUnfSF -2.246939 EnclosedPorch 2.205690 KitchenAbvGr 4.966928 LowQualFinSF 8.572700 MiscVal 5.174866 PoolArea 13.953877 ScreenPorch 3.036266 TotalBsmtSF -5.496282 SalePrice 1.935362 dtype: float64
Numerical features distribution after log transformation
%%time
fig, ax = plt.subplots(6,4, figsize=(16,18))
for i, col in enumerate(skew_cols):
plt.subplot(6,4,i+1)
sns.distplot(df_train[col])
ax[-1, -1].axis('off')
plt.tight_layout()
plt.show()
Wall time: 1min 49s
4. Modeling¶
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.model_selection import cross_val_score, StratifiedKFold, cross_validate, KFold, learning_curve, ShuffleSplit
from sklearn.metrics import classification_report, mean_squared_error, accuracy_score, r2_score
One-Hot Encoding for Categorical Features¶
cat_cols = df_train.dtypes[df_train.dtypes == np.object] # filtering by string categoricals
cat_cols = cat_cols.index # list of categorical fields
for col in cat_cols:
df_train[col] = pd.Categorical(df_train[col])
df_train = pd.get_dummies(df_train, columns=cat_cols)
Split the Dataset¶
feature_cols = [x for x in df_train.columns if x != 'SalePrice']
X = df_train[feature_cols]
y = df_train['SalePrice']
X_train_val, X_test, y_train_val, y_test = train_test_split(X,y, random_state=26)
def rmse(ytrue, ypredicted):
return np.sqrt(mean_squared_error(ytrue, ypredicted))
def cross_validation_multi(estimator, X=X_train_val):
kfold = KFold(shuffle=True, random_state=26, n_splits=5)
cv_score = cross_validate(estimator=estimator, X=X, y=y_train_val, scoring=['neg_mean_squared_error','r2'], cv=kfold, n_jobs=4)
return cv_score
Baseline model¶
- Mean value
Predict the SalePrice as its mean value for all data. This model performs very bad as we can see from the r2 value.
y_mean = np.zeros(y_test.shape)
y_mean[:] = y.mean()
mean_rmse = rmse(y_test, y_mean).round(2)
mean_r2 = r2_score(y_test, y_mean).round(3)
print(f'RMSE: {mean_rmse}')
print(f'r2: {mean_r2}')
RMSE: 80251.42 r2: -0.001
- Logistic Regression Baseline Model
This is a LogisticRegression estimator without regularization and without feature scaling.
baseline_pl = LinearRegression()
baseline_pl.fit(X_train_val, y_train_val)
y_pred_baseline = baseline_pl.predict(X_test)
print('\nEvaluation on test set')
baseline_rmse = rmse(y_test, y_pred_baseline).round(2)
baseline_r2 = r2_score(y_test, y_pred_baseline).round(3)
print(f'RMSE: {baseline_rmse}')
print(f'r2: {baseline_r2}')
Evaluation on test set RMSE: 70953.37 r2: 0.218
Prepare a function to plot cross validation result¶
alpha values used starts from 0.01, and $\alpha_{i+1} \approx 3\alpha_i$
def plot_cv(estimator, X=X, alphas=np.array([0.01, 0.03, 0.1, 0.3, 1, 3, 10, 30])):
rmse_scores = []
r2_scores = []
for alpha in alphas:
cvs = cross_validation_multi(X=X, estimator=estimator(alpha=alpha))
rmse_score = np.sqrt(cvs['test_neg_mean_squared_error'].mean()*-1)
rmse_scores.append(rmse_score)
r2_score = cvs['test_r2'].mean()
r2_scores.append(r2_score)
print(f'alpha ({alpha}): RMSE {round(rmse_score,2)}, r2 {round(r2_score,3)}')
plt.subplots(1,2,figsize=(10,3))
plt.subplot(1,2,1)
sns.lineplot(alphas, rmse_scores, label='RMSE')
plt.xlabel('alpha')
plt.subplot(1,2,2)
g=sns.lineplot(alphas, r2_scores, label='r2')
g.set_xlabel('alpha')
plt.legend()
Ridge Regularization¶
We'll try LogisticRegression model with L2 penalty, using different regularization strength. The alpha parameter is the regularization strength. The lowest RMSE is when alpha = 10.
plot_cv(X=X_train_val, estimator=Ridge)
plt.suptitle('L2 Penalty', size=18)
plt.tight_layout()
plt.show()
alpha (0.01): RMSE 37150.8, r2 0.776 alpha (0.03): RMSE 36389.51, r2 0.785 alpha (0.1): RMSE 35651.06, r2 0.794 alpha (0.3): RMSE 34906.52, r2 0.803 alpha (1.0): RMSE 34252.29, r2 0.811 alpha (3.0): RMSE 33972.95, r2 0.815 alpha (10.0): RMSE 33902.15, r2 0.817 alpha (30.0): RMSE 34268.29, r2 0.813
Lasso Regularization¶
LogisticRegression model with L1 penalty, using different regularization strength. The alpha parameter is the regularization strength. Unlike previous model with Ridge regularization, Lasso gives wider performance range within the same alpha value range. In Ridge regularization the RMSE ranges between 34000 - 37000 while in Lasso regulatozation the RMSE ranges from 32000 - 44000.
For Lasso regularization RMSE is the lowest in alpha = 30 while lower alpha value give higher error.
plot_cv(X=X_train_val, estimator=Lasso)
plt.suptitle('L1 Penalty', size=18)
plt.show()
alpha (0.01): RMSE 44633.33, r2 0.666 alpha (0.03): RMSE 44305.09, r2 0.671 alpha (0.1): RMSE 43186.73, r2 0.689 alpha (0.3): RMSE 40308.07, r2 0.731 alpha (1.0): RMSE 37124.13, r2 0.775 alpha (3.0): RMSE 36507.55, r2 0.783 alpha (10.0): RMSE 33567.14, r2 0.819 alpha (30.0): RMSE 32817.88, r2 0.828
ElasticNet Regularization¶
This model combines Lasso and Ridge regularization. From the graph below, the bigger the alpha, the higher the RMSE. The best alpha for ElasticNet regularization is at 0.01.
plot_cv(X=X_train_val, estimator=ElasticNet)
plt.suptitle('Combined L1 - L2 Penalty with Transformation', size=18)
plt.show()
alpha (0.01): RMSE 33928.71, r2 0.816 alpha (0.03): RMSE 33930.6, r2 0.817 alpha (0.1): RMSE 34488.41, r2 0.811 alpha (0.3): RMSE 35620.32, r2 0.798 alpha (1.0): RMSE 37851.58, r2 0.772 alpha (3.0): RMSE 41927.75, r2 0.721 alpha (10.0): RMSE 48942.26, r2 0.618 alpha (30.0): RMSE 55683.52, r2 0.503
Comparing the Models¶
Each models, Ridge, Lasso, and ElasticNet is then evaluated on the dataset using the best alpha for each model. The best alpha value for Ridge, Lasso, and ElasticNet are 10, 30, and 0.01 respectively.
f = plt.figure(figsize=(6,6))
ax = plt.axes()
labels = ['Ridge', 'Lasso', 'ElasticNet']
models = [Ridge(alpha=10), Lasso(alpha=30), ElasticNet(alpha=0.01)]
rmse_scores = []
r2_scores = []
for mod, lab in zip(models, labels):
mod.fit(X_train_val, y_train_val)
y_pred = mod.predict(X_test)
rmse_scores.append(rmse(y_test, y_pred).round(2))
r2_scores.append(r2_score(y_test, y_pred))
ax.plot(y_test, y_pred,
marker='o', ls='', ms=3.0, label=lab)
leg = plt.legend(frameon=True)
leg.get_frame().set_edgecolor('black')
leg.get_frame().set_linewidth(1.0)
ax.set(xlabel='Actual Price',
ylabel='Predicted Price',
title='Linear Regression Results');
rmse_vals = [mean_rmse, baseline_rmse]
rmse_vals.extend(rmse_scores)
r2_vals = [mean_r2, baseline_r2]
r2_vals.extend(r2_scores)
labels = ['Mean value baseline', 'Linear baseline', 'Ridge', 'Lasso', 'ElasticNet']
rmse_df = pd.DataFrame(zip(rmse_vals,r2_vals), columns=['RMSE','r2'], index = labels)
rmse_df
RMSE | r2 | |
---|---|---|
Mean value baseline | 80251.42 | -0.001000 |
Linear baseline | 70953.37 | 0.218000 |
Ridge | 30507.26 | 0.855407 |
Lasso | 30498.22 | 0.855492 |
ElasticNet | 30742.64 | 0.853167 |
5. Best Model¶
All of the previous model with regularization performed better than the baseline models. This means that our model performed great in terms of accuracy as shown from the RMSE.
Our ultimate goal is we want to accurately predict house sale price based on its characteristics. For this reason, we will choose the model with Lasso regularization with alpha = 30 since it gives the lowest RMSE and highest $R^2$ of all models above.
When we evaluate this model on the test dataset, we get RMSE of $ 30498.
def plot_learning_curve_scoring(estimator, title, X, y, axes=None, ylim=None, cv=None, scoring=None,
n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
_, axes = plt.subplots()
axes.set_title(title)
if ylim is not None:
axes.set_ylim(*ylim)
axes.set_xlabel("Training examples")
axes.set_ylabel("Score")
train_sizes, train_scores, test_scores, fit_times, _ = \
learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs,
train_sizes=train_sizes, scoring=scoring,
return_times=True)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
# Plot learning curve
axes.grid()
axes.fill_between(train_sizes, train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std, alpha=0.1,
color="r")
axes.fill_between(train_sizes, test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std, alpha=0.1,
color="g")
axes.plot(train_sizes, train_scores_mean, 'o-', color="r",
label="Training score")
axes.plot(train_sizes, test_scores_mean, 'o-', color="g",
label="Cross-validation score")
axes.legend(loc="best")
return plt
Learning Curve¶
The learning curve below shows that our model still suffers from overfitting since there are considerably big gap between training score and cross-validation score. However, adding more training data will not solve the issue since both scores already reached plateau at training examples of 800 rows.
%%time
title = r"Learning Curves (Lasso)"
model = Lasso(alpha=30)
cv = ShuffleSplit(n_splits=10, test_size=0.25, random_state=0)
plot_learning_curve_scoring(model, title, X_train_val, y_train_val, #axes=axes[:, 2], ylim=(0.7, 1.01),
cv=cv, scoring='neg_mean_squared_error', n_jobs=4)
plt.show()
Wall time: 26.8 s
Top predictor¶
These are 20 top predictors of SalePrice
lasso_best = Lasso(alpha=30)
lasso_best.fit(X_train_val, y_train_val)
feature_coef = pd.DataFrame({'coef':lasso_best.coef_}, index=X.columns)
coef_highest = feature_coef.abs().sort_values('coef', ascending=False).head(20).index
fig,ax = plt.subplots(1,1,figsize=(6,12))
feature_coef.loc[coef_highest].sort_values('coef').plot.barh(ax=ax)
plt.show()
feature_coef.loc[coef_highest].sort_values('coef', ascending=False)
coef | |
---|---|
PoolQC_Ex | 114557.525577 |
RoofMatl_WdShngl | 85713.331270 |
SaleType_New | 57784.427882 |
OverallCond | 45421.377510 |
Neighborhood_StoneBr | 43957.617029 |
Neighborhood_NoRidge | 42212.280004 |
1stFlrSF | 40948.234407 |
GarageQual_Ex | 30417.683992 |
KitchenQual_Ex | 27629.658234 |
BsmtQual_Ex | 24572.664475 |
GrLivArea | 23980.320676 |
Exterior2nd_ImStucc | 23856.381535 |
ExterQual_Ex | 20465.314198 |
Neighborhood_NWAmes | -16757.809223 |
Neighborhood_Mitchel | -16962.617660 |
Neighborhood_Gilbert | -18842.131156 |
MSZoning_C (all) | -19875.365838 |
KitchenAbvGr | -41235.746670 |
SaleCondition_Partial | -41445.785905 |
RoofMatl_ClyTile | -361571.998900 |
print(f'Of {len(feature_coef)} coefficients, {len(feature_coef[feature_coef.coef!=0])} are non-zero with Lasso.')
Of 294 coefficients, 173 are non-zero with Lasso.
The best predictor is PoolQC_Ex
which is the PoolQC
feature having value of Ex
. Based on the more detailed data dictionary, Ex
means 'Excellent'. This is expected, since a house with a pool would have higher price compared with those with no pool. The excellent quality of the pool makes the price even higher. A house with an excellent quality pool will have higher price as much as $ 114557 compared with house with no excellent quality pool, assuming the other characteristics stay the same.
6. Conclusion and Suggestion¶
We have tried Ridge, Lasso, and ElasticNet regularization with different regularization strength. Our best linear regression model with regularization performed better than the baseline model. We evaluated our model on the test set. It turns out that whether a house have an excellent quality pool (PoolQC_Ex
) is the best predictor of house sale price.
The numerical features we used did not all come from normal distribution. However, log transformation could change the distribution to approximate normal distribution.
Our model suffers from overfitting. Elimination of some highly-correlated features might be beneficial.